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<N Abstract 
H 

Finding interactions between variables in large and high-dimensional datasets is often a 
serious computational challenge. Most approaches build up interaction sets incrementally, 
adding variables in a greedy fashion. The drawback is that potentially informative high- 
order interactions may be overlooked. Here, we propose at an alternative approach for 
classification problems with binary predictor variables, called Random Intersection Trees. 
i— i It works by starting with a maximal interaction that includes all variables, and then 

gradually removing variables if they fail to appear in randomly chosen observations of a 
class of interest. We show that informative interactions are retained with high probability, 
and the computational complexity of our procedure is of order p K for a value of k that 
can reach values as low as 1 for very sparse data; in many more general settings, it will 
still beat the exponent s obtained when using a brute force search constrained to order s 
interactions. In addition, by using some new ideas based on min-wise hash schemes, we 
I are able to further reduce the computational cost. Interactions found by our algorithm 

£>. can be used for predictive modelling in various forms, but they are also often of interest 

in their own right as useful characterisations of what distinguishes a certain class from 
others. 

CN 
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1 Introduction 

• • 

In this paper, we consider classification with high-dimensional binary predictors. We suppose 
^ we have data that can be written in the form (Yi,Xi) for observations i = 1, . . . , n; Y{ is 

the class label and Aj C {l,...,p} is the set of active predictors for observations i (out 
of a total of p predictors). An important example of this type of problem is that of text 
classification, where then Aj is the set of frequently appearing words (in a suitable sense) for 
document i, and Yi indicates whether the document belongs to a certain class. In this case, 
the dimension p can be of the order of several thousand or more. More generally, if data 
with continuous predictors are available, they can be converted to binary format by choosing 
various split-points, and then reporting whether or not each variable exceeds each of these 
thresholds. 

Our aim here is to develop methodology that can discover important interaction terms 
in the data without requiring that any of their lower order interactions are also informative. 
More precisely, we are interested in finding subsets S C {1, . . . ,p} of all predictor variables 
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that occur more often for observations in a class of interest than for other observations. We 
will use the terms "leaf nodes", "rules", "patterns" and "interactions" interchangeably to 
describe such subsets S. For simplicity, suppose there are only two classes, the set of labels 
being {0, 1}. The case with more than two classes can be dealt with using one-versus-one, or 
one- versus- all strategies. Given a pair of thresholds, < 9q < 9\ < 1, our goal is to find all 
sets S (or as many as possible), for which 

W n (S C X\Y = 1) > 0i and F n {S C X\Y = 0) < 9 . (1.1) 

Here and throughout the paper, we use the subscript n to indicate that the probabilities are 
empirical probabilities. For example, for c 6 {0, 1}, 



F„.(S C X\Y = c) : = ji-r 1 



|C C , 

where we have denoted the set of observations in class c by C c . Of course, one would also be 



interested in sets S which satisfy a version of (1.1) with classes 1 and interchanged, but we 



will only consider (1.1) for simplicity. 

The interaction terms uncovered can be used in various ways. For example, they can be 
built into tree-based methods, or form new features in linear or logistic regression models. The 
interactions may also be of interest in their own right, as they can characterise distinctions 
between classes in a simple and interpretable way. These potentially high-order interactions 
that our method aims to target would be very difficult to discover using existing methods, as 
we now explain. 

A pure brute force search examines each potential interaction S of a given size to check 



whether it fulfills (1.1). Restricting the order of interactions to size s, the computational 
complexity scales as p s , rendering problems with even moderate values of p infeasible. 

Instead of searching through every possible interaction, tree-based methods build up in- 
teractions incrementally. A typical tree classifier such as CART |Breiman et al. 1984 works 
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by building a decision tree greedily from root node to the leaves; see also Loh and Shih 
The feature space is recursively partitioned based on the variable whose presence or absence 
best distinguishes the classes. The myopic nature of this strategy makes it a computation- 
ally feasible approach, even for very large problems. The downside is that it produces rather 
unstable results and hence gives poor predictive performance. Moreover, because of the incre- 
mental way in which interactions are constructed, the success of this strategy in recovering an 
important interaction S rests on at least some of its lower order interactions being informative 
for distinguishing the classes. 

Approaches based on tree ensembles can somewhat alleviate the problem of tree instability; 



Random Forests Breiman, 2001 is a prominent example. Here the data with which the 
decision trees are constructed is sampled with replacement from the original data. Further 
randomness is introduced by randomising over the subset of variables considered for each split 
in the construction of the trees. While the results of Random Forests are very complex and 
hard to interpret, one can examine what are known as variable importance measures. These 



aim to quantify the marginal or pairwise importance of predictor variables |Strobl et al. 



20081. Though such measures can be useful, they may fail to highlight important high-order 



interactions between variables. 

More recently, there has been interest in algorithms that start from deep splits or leaf 
nodes in trees and then try to build a simpler model out of many thousands of these leaves 
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and the general framework of Decision 
Though these methods have been demon- 
strated to improve on Random Forests in some situations, they nevertheless crucially rely on 
a good initial basis of leaf nodes. These bases are usually generated by tree ensemble methods 
and so, if the base trees miss some important splits, they would also be absent in the results 
of these derivative algorithms. 

A complementary approach has developed in data mining under the name of frequent 



itemset search, starting with the Apriori algorithm Agrawal et al. , 1994 , which has since 
then developed into many improved and more specialised forms. The starting point for these 
was "market basket analysis", where the shopping behaviour of customers is analysed and 
the goal is to identify baskets that are often bought together. While generally very successful, 
these methods work on the principle that subsets of frequent itemsets are also frequent. Thus 
if there is no strong marginal effect of any of the variables that is involved in a decision rule, 
there is no advantage in general compared to a brute force search. 

We now give a simple example where tree-based approaches and those based on the Apriori 
algorithm will struggle. Suppose our data are independent realisations of the pair of random 
variables (X,Y), whose distribution is given as follows. Let Z be the random binary vector 

z = (l{iex} ; • • • ) l{ P ex}) T - 

Suppose that, conditional on {Y = 0}, Zk for k = 1, . . . ,p are independent; and conditional 
on {Y = 1}, Zfc for k = 2, . . . ,p are independent and Z\ = Zi- We take the marginal distri- 
butions of the Zk to all be Bernoulli(g^). Finally, let Y be independent with a Bernoulli(gy) 
distribution. Then the interaction S = {1,2} is certainly important for distinguishing the 
classes. However, any individual variable in {1, . . . ,p} does not appear more or less frequently 
on average among class 1 compared to class 0. This lack of any marginal relationship between 
the class label and the first two predictors would cause tree-based methods to perform poorly. 
In addition, using the Apriori algorithm or the brute force method to find S would require 
computational cost of the order p 2 . 

This paper looks at a new way to discover interactions, which we call Random Intersection 
Trees. Rather than searching through potential interactions directly, our method works by 
looking for collections of observations whose common active variables together form infor- 
mative interactions. We present a basic version of the Random Intersection Trees algorithm 
in the following section. This approach allows for computationally feasible discovery of in- 
teractions in settings where most existing procedures would perform poorly. Bounds on the 
complexity of our algorithm are given in Section [3| For example, our results yield that in 
the scenario discussed in the previous paragraph, the order of computational complexity of 
our method is at most o(p K ) for any k > 1. In Section [4j we propose some modifications of 
our basic method to reduce its computational cost, based on min-wise hash schemes. Some 
numerical examples are given in Section [5] We conclude with a brief discussion in Section |6j 
and all technical proofs are collected in the appendix. 

2 Random Intersection Trees 

Our method searches for important interactions by looking at intersections of randomly chosen 
observations from class 1. We start with the full set of variables, and then remove those that 
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are not present in the observations chosen. If a pattern S has high prevalence in class 1, 
i.e. ¥ n (X = S\Y = 1) is large, it will be included in the observations chosen with high 
probability. Thus, provided the overall process is repeated often enough, S is likely to be 
retained in at least one of the final intersections. One could then consider each of these 



intersections as possible solutions of (1.1), checking whether their prevalence among class 
is below 9q. 

Arranging the procedure in a tree-type search makes the algorithm more computationally 
efficient. To describe the details, we first define some terms associated with trees that will be 
needed later. Recall that a tree is a pair (JV, E) of nodes and edges forming a connected acyclic 
(undirected) graph. We will always assume (with no loss of generality) that N = {1, . . . ,\N\}. 
A rooted tree is the directed acyclic graph obtained from a tree by designating one node as 
root and directing all edges away from this root. 

Let a and /3 be two nodes in a rooted tree, with f3 not the root node. If (a, (3) E E, /3 is 
said to be the child of a, and a, the parent of (3. We will denote by ch(a), the set of children 
of a node a. Since we are only considering rooted trees here as opposed to general directed 
graphs, we will differ with convention slightly and will use pa(/3) to mean the unique parent 
of (3. Thus here, pa(/3) is a node itself, whereas ch(a) is a set of nodes. 

If a ^ /3 lies on the unique path from the root to (3, we say a is an ancestor of (3, and 
(3 is a descendant of a. The depth of a, denoted depth(a), is the number of ancestors of a: 
depth(a) = |an(a)|. In particular, the depth of the root node is 0. The depth (also known 
as the height) of a rooted tree is the length of the longest path, or equivalently, the greatest 
number of ancestors of any particular node. By level d of the tree, we will mean the set of 
nodes with depth d. 

We will say an indexing of the nodes is chronological if, for every parent and child pair, 
larger indices are assigned to the child than the parent. In particular, the root node will be 1. 
Note that both depth-first and breadth-first indexing methods are chronological in this way. 

Algorithm 1 A basic version of Random Intersection Trees 

for tree m = 1 to M do 

Construct rooted tree m of depth D. Each node in levels 0, . . . , D — 1 has B children 
and B can be random. Let J be the total number of nodes in the tree, and index the 
nodes in a chronological way. 

Set Si to be a randomly chosen observation from class 1. 
for node j = 2 to J do 

Draw a random observation i(j) from class 1. 

Set Sj = AjQ) n S p ^jy 
end for 

Denote the collection of resulting sets of all nodes at depth d, for d = 1, . . . , D, by 
Ld, m = {Sj : depth(j) = d}. 
end for 

return L D := Um=i ^D,m- 



Algorithm 1 describes a basic version of the Random Intersection Trees procedure. We see 
that, each node in each tree is associated with a randomly drawn observation from class 1. 
For every tree, we visit each non-root node in turn, and compute the intersection of the 
observation assigned to it, and all those assigned to its ancestors. Because of the way the 
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nodes are indexed, parents are always visited before their children, and this intersection can 
simply be computed as Sj = X^rj\ n Sw^V This is crucial to reducing the computational 
complexity of the procedure, as we shall see in the next section. 

Each of the sets assigned to the leaf nodes of each of the trees yield a collection of potential 
candidate interactions, Ljj. One could then proceed to test these as potential solutions to 



(1.1); we present a more efficient approach in Section |4j where we build this testing step 
into the construction of the trees. An illustration of this improved algorithm applied to the 
Tic-Tac-Toe data discussed in Section [5] is given in Figure [TJ Here, the root node contains a 
randomly drawn final win-state for black (class 1). This corresponds to S\ in our algorithm. 
For each other node j, the randomly chosen additional black- win state -Xjm is shown along 
the edge from its parent node and the new intersection Sj in the corresponding node. The 
early stopping that is added in the improved algorithm also allows to run until the algorithm 
has terminated in all nodes and no prior specification of the tree depth will thus be necessary 
in practice, as will be shown in Section [4} 

3 Computational complexity 

How many trees do we have to compute to have a very high probability of finding an inter- 



esting interaction S that fulfills (1.1)? And what is the required size of these trees? If the 
interaction is not associated with a main effect, most approaches like trees and association 
rules would require of order searches. In this section, we show that in many settings, 
Random Intersection Trees improves on this complexity. We consider a single interaction S 
of size s = \S\, and examine the computational cost for returning S as one of the candidate 
interactions, with a given probability. We will see that this depends critically on three factors: 

• Prevalence Q\ := P n (5 C X\Y = 1) of the interaction pattern. If the pattern S in 
question appears frequently in class 1, the search is more efficient. 



Sparsity 5k ■= ^ n (k 6 S\Y = 1) of the predictor variables k = I, . . . ,p. If 6^ is 
very low for many k (and sparsity of predictors consequently high), computation of 
the intersections is much cheaper, and so overall computational cost is greatly reduced. 
Indeed, for a fixed tree m, consider a node j with depth d < D. We have that 

E(N) = X>t +1 - 

k=l 

Thus, for j' G ch(j'), computation of Sji requires on average at most 

o(fog(p)X>f +1 ) 

operations. This is because in order to compute the intersection, one can check whether 
each member of Sj is in X^jn, and each such check is 0(log(p)) if the sets X, are ordered 
so a binary search can be used. If we compare this to the 0(p) computations required 
to calculate each of the Sj if no tree structure were used, we see that large efficiency 
gains are possible when d > 1 if many variables are sparse. For intersections with the 
root node, the tree structure offers no advantage, and in practice, branching the tree 
only after level 1 (so the root node has only one child), is more efficient, though this 
modification does not improve the order of complexity. 
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Figure 1: An intersection tree. Starting with a randomly chosen class 1 (black wins) ob- 
servation at the root node, B = 4 randomly chosen class 1 observations are intersected with 
the pattern. These randomly chosen observations are shown along the edges and the resulting 
intersections Sj as the nodes in the next layer of the tree. Nodes are only shown if the cor- 
responding patterns Sj have an estimated prevalence among class below a set threshold; the 
branching of the tree terminates for all other nodes. The algorithm continues until all result- 
ing Sj corresponding to the leaf nodes have prevalence among class exceeding the threshold. 
Here, one of the winning states for black is filtered out after three intersections. 
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Independence of S: Define v := max^gsc ¥ n (k G X\S Q X,Y = 1). If v is low, 
less computational effort is required to recover S. Note that if, for some k G S c , 
¥ n (k G X\S C X) = 1, interest would centre on 5U {k} rather than S itself. Indeed, if 



S satisfied (1.1), so would SL) {k}. In general, if v is large, the search will tend to find 



sets containing S, though not necessarily S itself. 

With the assumptions that 9\ > and v < 1, we can give a bound on the computational 
complexity of the basic version of Random Intersection Trees introduced in the previous 
section. 

Theorem 1. For suitable choices of M, D and the distribution of B, the expected order of 
computations needed for Ld to contain S with probability 1 — ry < 1 is bounded above by the 
minimum over e G (0, 1] of 

logs/ \ (- logU1+e)5k / ei } -I 

iog(i/r?) & y 1 <p+ 22 p los(1/v) r- t 3 - 1 ) 

^ fc:(l+e)<5 fc >01 ^ 

As a function of the number of variables p, there is a contribution of plog 2 (p) and an 
additional contribution in the brackets that depends on the sparsity 5^ of each variable. 
Sparse variables do not contribute to this sum and the sum can be arbitrarily close to 1 if the 
sparsity among variables is high enough. This would yield a computational complexity with 
order bounded above by o(p K ) for any n > 1, compared to the corresponding complexity of p s 
for a brute force search. In most interesting settings, however, we would not achieve a nearly 
linear scaling in complexity, but would hope to still be faster than a brute force search. 

The influence of sparsity on computational complexity. It is interesting to make the 
influence of the sparsity of individual variables, 5k, on the overall computational complexity, 
more explicit. We have the following corollary to Theorem [TJ 

Corollary 2. Define /3 by v = 6*f . Suppose that 7, a*, a* are such that a* > a±, and 

5k < 0\- a * forallje{l,...,p} 
5k > 9\~ a * for at most 0(p 1 ) variables. 

For suitable choices of M, D and the distribution of B, the expected order of computations 
needed for Lq to contain S with probability 1 — 77 < 1 is bounded above by 

°{p K ) for an^>max{^+ 7 , ^ ~*~^}' 

The implication of Theorem[2]is most apparent if we take 7 = 1 as we can then set a* = 0. 
In this case, 

* = -i _ log(rnax fc 5 k ) 
log(0i) ' 

We can then bound the computational complexity by 

o(p ) for any k > H — — -— . (3.3) 

log(l/V) 
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The fraction on the right-hand side is a function of the prevalence of the pattern S, 9±, 
the maximum sparsity of the variables, and the maximum sparsity of the variables in S c , 
conditional on the presence of S. As long as this fraction is less than 1, the computational 
complexity is guaranteed to be better than a brute force search with the knowledge that 
s = 2, and the relative advantage grows for larger sizes of the pattern. 



Independent noise variables. To gain further insight, we consider the special case where 
variables in S c are independent of S (conditional on being in class 1), in the sense that for 
all k G S c , 

P n (k e X\S CX,Y = 1) = F n {k G X\Y = 1) = 6 k . (3.4) 



Corollary 3. Assume (3.4) and that 5k < 1 for all k. Define r := log(#i)/ log(maxfc 5k)- 
Then for suitable choices of M , D and the distribution of B, the expected order of computa- 
tions needed for Ld to contain S with probability 1 — n < 1 is bounded above by 

°{p K ) f or an V k > t. (3-5) 

We see that the computational complexity is approximately linear in p if the prevalence 
of the pattern S is as high as the prevalence of the least sparse predictor variables. 



We can also consider the situation where in addition to the independence ( 3.4 ) , all variables 
have the same sparsity 5. If the prevalence 6\ of S is only as high as that of a random 
occurrence of two independent predictor variables, we get r = 2 and the computational 
complexity is quadratic in p. In this case, the algorithm would not yield a computational 
advantage over brute force search if looking for patterns of size 2. This is to be expected 
since every pattern S of size 2 would have the same prevalence in this scenario, and so there 
is nothing special about a pattern S of size 2 with prevalence <5 2 , and in general no hope of 



beating the complexity p s of a brute force search. However, the bound in (3.5) is independent 
of s. Thus provided the prevalence 9\ drops more slowly that the rate 5 s , at which every 
pattern of size S would occur randomly among independent predictor variables, our results 
show that Random Intersection Trees is still to be preferred over a brute force search. 



4 Early stopping using min-wise hashing 

While Algorithm 1 is computationally attractive, the following observation suggests that 
further improvements are possible. Suppose that, for a particular tree, we have just computed 
the intersection Sj corresponding to a node j at depth d < D. If 

PnOSj C X\Y = 0) > O , 
then since for all f E de(j'), Sy C Sj, we also have 

F n (S f C X\Y = 0) > O . 
Thus no intersection sets corresponding to descendants of j have any hope of yielding solutions 



to (1.1), and so all further associated computations are wasted. 

In view of this, one option would be to compute the quantity F n (Sj C X\Y = 0) at 
each node j as the algorithm progresses, and if this exceeds the threshold 6q, not visit any 
descendants j' of j for computation of Sj'. This could be prohibitively costly, though, as it 
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would require a pass over all observations in class 0, for each node of each tree. One could 
work with a subsample of the observations, but if 9q is low, the subsample size may need to 
be fairly large in order to estimate the probabilities to a sufficient degree of accuracy. 

Instead, we propose a fast approximation, using some ideas based on min-wise hashing 
|Broder et al. 1998, Cohen et al. 2001, Datar and Muthukrishnan 2002 applied to the 
columns of the data-matrix. We describe the scheme by leaving aside the conditioning on 
Y = 0, which can be added at the end by restricting to observations in class 0. Consider 
taking a random permutation a of all observations {1, . . . , n}. Let h a {k) be the minimal value 
i such that variable k is active in observation ct(l): 

h a (k) = min-jy : k G A^/)}. 



It is well known Broder et al. 19981 that the probability that h a (k) and h a (k') agree for two 



variables k, k! under a random permutation a is identical to the Jaccard- index for the two 
sets Ik = {i ■ k G Xi} and Ik> = {i '■ k' G Xi}, that is 



F a (h a (k) = h a (k')) 



\hr\h' 

|/fcU Jfc/ 



Here the subscript a indicates that the probability is with respect to a random permutation a 
of the observations. A min-wise hash scheme is typically used to estimate the Jaccard-index 
by approximating the probability on the left-hand side of the equation above. 
Now, 

F n {S QX) = F n (k G X for all k G S) 

= F n (k G X for all k G 5 | 3 k! G S such that k' G X) 
x F n (3k G S such that k G X). 

Let us denote the first and second terms on the right-hand side by tti(S) and ^(S) respec- 
tively. Note that vri(5) is equal to the probability that all variables k G S have the same 
min-wise hash value h a (k): 



tti(S) = F a {3i : h a (k) = i for all k G S). 
Turning now to ^(5), observe that 

n + 1 



E a (mmh a (k)) = 

fceS TT2\o)n + 1 



and so 



7T 2 (S) 



n + l 



n 



1 



E CT (min fee 5 K(k)) n + l 



(4.1) 



(4.2) 



(4.3) 



A derivation of (4.2) is given in the appendix. 



Equations (4.1) and (4.3) provide the basis for an estimator of F n (S C X). First we 
generate L random permutations of {1, . . . , n}: a±, . . . , gl- We then use these to create an 
L x p matrix H whose entries are given by 



H, 



ik 



h ai {k). 
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Now we estimate n±(S) and tt2(S) by their respective finite-sample approximations, tti(S) 
and ^(S): 



7Ti(L; S, H) :— \ ^ ^{H lk =H lk i for all k ,k'eS} 
1=1 

n+1 [ 1 1 



tt 2 (L;S,H) : 



n 



Z Ez L =imin fce5 H lk n + 1 J 

Finally, we estimate P n (5' Q X) by 

P n (L;S,H) :=m(S,H) ■ tt 2 (S, H). (4.4) 

To our knowledge, this use of min-wise hashing techniques, and in particular the estimator 
7T2(S), is new. The estimator enjoys reduced variance compared to that which would be 
obtained using subsampling, as the following theorem shows. 



Theorem 4. For F n (L; S, H), iri(S) and tt s (S) defined as in (4-4h (4-^h an d (4-3) respec- 
tively, 

L(F n (L; S,H) - F n (S C X)) A N(0, Tr 2 (S) 2 7r 1 (S)(l - ^(5)^(5)) + o n (l)). (4.5) 

A derivation is given in the appendix. Here the subscript n in o n (l) is used to indicate 
that n is tending to infinity. Comparing the variance of the normal distribution in ( |4.5[ ) to 
that which would be obtained if subsampling (7T2(S')7ri(5)(l — ^1(5)^2(5")) + o n (l)), we see 
that a factor of ^(S*) is gained: matching the accuracy of subsampling with the min-wise hash 
scheme would require roughly l/v^S 1 ) times as many samples. By using min-wise hashing, 
choosing L = 100 typically delivers a reasonable approximation as long as we just want to 
resolve values at #0 = 0.01 and above. 

An improved version of Algorithm 1, building in the ideas discussed above, is given in 
Algorithm 2 below. Note that P n (5 pa (j), H) need only be computed once for every j with the 
same parent. 

Early stopping decreases the computational cost of the algorithm as many nodes in the 
trees generated may not need to have their associated intersections calculated. In addition, 
the set of candidate intersections L £> will be smaller but the chance of it containing interesting 
intersections would not decrease by much. These gains comes at a small price, since the min- 
wise hash matrix H must be computed, and the computational effort going into this will 



in turn determine the quality of the approximation in (4.4). We have previously shown the 
complexity bounds in the absence of early stopping and thus avoided the difficulty of making 
this trade-off explicit. We will use the improved version of Random Intersection Trees with 
early stopping in all the practical examples to follow, taking small values of L in the range of 
a (few) hundred permutations. 

The depth D of the tree is still given explicitely in Algorithm 2. An interesting modification 
creates the tree recursively. Starting with the root node, B children are added to all leaf nodes 
of the tree in which the early stopping citerion has not been triggered yet. When the algorithm 
terminates, all intersection in the leaf nodes of the final tree are collected. 
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Algorithm 2 Random Intersection Trees with early stopping 



Compute the L x p min-wise hash matrix H, using only class observations. 

for tree m = 1 to M do 

Construct rooted tree m. Each node in levels 0, . . . , D — 1 has B children and B can 
be random. Let J be the total number of nodes in the tree, and index the nodes in a 
chronological way. 

Set Si to be a randomly chosen observation from class 1. 
for node j = 2 to J do 
if F n {S Mj) ,H) <9 then 

Draw a random observation i(J) from class 1. 
Set Sj = Xjy) n <Spa(j) • 
end if 
end for 

Denote the collection of resulting sets of all nodes at depth d, for d = 1, . . . , D, by 
Ld, m = {Sj : depth(j) = d}. 
end for 

return L D := Um=i L D,m- 



5 Numerical Examples 

In this section, we give two numerical examples to provide further insight into the performance 
of our method. The first is about learning the winning combinations for the well-known game 
Tic-Tac-Toe. This example serves to illustrate how Random Intersection Trees can succeed in 
finding interesting interactions when other methods fail. The second example concerns text 
classification. Specifically, we want to find simple characterisations (using only a few words, 
or word-stems in this case) for classes within a large corpus in a large-scale text analysis 
application. 



5.1 Tic-Tac-Toe endgame prediction 



The Tic-Tac-Toe endgame dataset |Matheus and Rendell 1989 Aha et al. 1991| contains 



all possible winning end states of the game Tic-Tac-Toe, along with which player (white or 
black) has won for each of these. There are just under 1000 possible such end states, and our 
goal is to learn the rules that determine which player wins from a randomly chosen subset of 
these. We use half of the observations for training, and the other half for testing. 

There are 9 variables in the original dataset which can take the values 'black', 'white' or 
'blank'. These can trivially be transformed into a set of twice as many binary variables where 
the first block of variables encodes presence of black and the second block encodes presence 
of white. 

Two properties of this dataset that make it particularly interesting for us here are: 

• The presence of interactions is obvious by the nature of the game. 

• There are only very weak marginal effects. Knowing that the upper right corner is 
occupied by a black stone is only very weakly informative about the winner of the 
game. Greedy searches by trees fail in the presence of many added noise variables and 
linear models do not work well at all. 
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Figure 2: The probability of choosing a given pattern with Random Intersection Trees (bottom 
row), Random Forests of depth 3 (middle row) and brute force search among all interactions 
of size 3 (top row) for the Tic-Tac-Toe data (left panel); and the same results in the case 
when 100 noise variables are added (right panel). Note that Random Intersection Trees were 
not constrained to find interactions of depth 3. The area of each pattern is proportional to 
the probability of being chosen. In the case with noise variables, some of the patterns with the 
very smallest areas also contained a small number of noise variables, which are not shown. 
Just counting three- to five-way interactions, there are more than 10 8 potential interactions 
when 100 noise variables are added. 




Figure 3: From left to right: the misclassification rate (in %) on Tic-Tac-Toe data for 0, 
60, 300 and 400 added noise variables. Each classifier is tuned to have equal misclassification 
rate in both classes. The simple classifier based on Random Intersection Trees (RI) has a 
misclassification rate of 0% in all cases, as the winning patterns are sampled very frequently 
(see Figure^. Random Forests (RF) and Random Forests limited to depth 3 trees (RF3) 
are competitive but the misclassification rate increases sharply when many noise variables are 
added. 



We apply Random Intersection Trees to finding patterns that indicate a black win (class 1), 
and also patterns that indicate a white win (class 0) . We use the early stopping modifications 
proposed in Section |4j and create two min-wise hash tables from the available observations 
in each of the classes, taking L = 200. Figure [T] shows how the individual Intersection Trees 
are constructed and illustrates the use of the early stopping rule. We emphasise that we do 
not need to specify or know that the winning states are functions of only three variables. We 
let each tree run until all its branches terminate, and collect all resulting leaves. 

Figure [2] illustrates the importance sampling effect of Random Intersection Trees when 
using only the training data, and adding a varying number of noise variables. When adding 
100 noise variables, all 16 winning final combinations are among the 40 most frequently chosen 
patterns. All winning states are chosen hundreds of millions times more often than a random 
sampling of interactions would pick them. 

As discussed in Section [TJ the interactions or rules that are found could be entered into 
any existing aggregation method, such as Rule Ensembles Friedman and Popescu, |2008| or 



Decision Lists |Marchand and Sokolova 2006 Rivest , 1987 . Here, we consider an even simpler 
aggregation method by selecting all patterns during 1000 iterations of Random Intersection 
Trees (with B = 5 samples as branching factor in each tree) that were selected by at least two 
trees. For each selected pattern, we compute the (empirical) class distributions conditional on 
the presence and absence of the pattern, using the training sample. That is, for each selected 
pattern S, we compute 



\(Y = 1\X C S) and F n (Y = 1\X £ S). 
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Then, given an observation from the test set, we classify according to the average of the 
log-odds of being in class 1 calculated from each of the conditional probabilities above. 

Figure [3] shows the misclassification rates under situations with different numbers of added 
noise variables. The simple prediction based on Random Intersection Trees achieves perfect 



classification even when 400 noise variables are added. Neither A:-NN nor CART Breiman 



et al. , 1984] , either restricted to trees of depth 3 (TREE3) or depth chosen by cross-validation 



(TREE), are as successful, giving misclassification rates between 5% and 40%. Interestingly, 
trees of depth 3 perform much worse than deeper trees. The winning patterns are not identified 
in a pure form but only after some other variables have been factored in first. This also means 
that it is very hard to read the winning states of the trees, unlike the patterns obtained by our 
method. Random Forests also maintain a 0% misclassification rate up until about a hundred 
added noise variables but start to degrade in performance when further noise variables are 



added. It is easy to identify the noise variables from a variable importance plot [Strobl et al. 



2008 1 . However, within the signal variables the patterns are not easy to see since each variable 



is approximately equally important for determining the winner (with the slight exception of 
the middle field in the 3x3 board which is more important than the other fields) and the 
nature of the interactions is thus not obvious from analysing a Random Forest fit. 

5.2 Reuters RCV1 text classification 

The Reuters RCV1 text data contain the tf-idf (term frequency-inverse document frequency) 
weighted presence of 47148 word-stems in each document; for details on the collection and 



processing of the original data, see Lewis et al.l [2004 1. Each document is assigned possibly 



more than one topic. Here we are interested in whether Random Intersection Trees is able to 
give a quick and accurate summary of each topic. For each topic, we seek sets of word-stems, 
S, whose simultaneous presence is indicative of a document falling within that topic. 

To evaluate the performance of Random Intersections, we divide the documents into a 
training and test set with the first batch of 23149 documents as training and the following 
30000 documents as test documents. We compare our procedure to an approach based on 
Random Forests and a simple linear method. 

Random Forests and classification trees can be very time- and memory-intensive to apply 
on a dataset of the scale we consider here. In order to be able to compute Random Forests, 
we only consider word-stems if they appear in at least 100 documents in the training data. 
This leaves 2484 word-stems as predictor variables. We also only consider topics that contain 
at least 200 documents. To simplify the problem further, we consider a binary version of the 
predictor variables for all methods, using a 1 or to represent whether each tf-idf value is 
positive or not. 

Let C be the set of topics in our modified dataset. Let FCC indicate the topics that a 
given document belongs to. Consider a topic or class c G C. Our goal is to find patterns S 
that maximise 

F n (ceY\S <ZX), (5.1) 

whilst also maintaining that the prevalence of S among all observations be bounded away 
from 0. Specifically, we shall require that 

Pn(5 C X) > p c /10 where p c = F n (c G Y). (5.2) 

To see how this can be cast within the framework set in ( |1.1[ ) , note that if S* maximises ( |5.1| ) 
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Figure 4: The misclassification rate F n (c ^ Y\S C X) on the test data for a patterns S 
chosen with a tree ensemble node generation mechanism (black circle), Random Intersections 
(white circle), and a linear method (black triangle) for topics c £ C in the Reuters RCV1 text 
classification data. The topics are shown on the left and the word combinations chosen by 
Random Intersection Trees on the right. 
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and S** satisfies 



P n (5** C X\Y G c) > P n (S* C X|y G c) and 

F n (s** c x|y ^ c ) < p n (s* c x|y ^ c ), 



(5.3) 
(5.4) 



then 



5 „(c G y|s* C X) 



< 



F n (S* C X\c G y)P ra (c G Y) 

Ms* c x| c g y)p n ( c g y) + p„(s* c x| c ^ y)p„(c g yj 

Pn(5** g jTjc G y)P n (c G Y) 

Ms** Q x\ c g y)p„(c g y) + p n (5** c x| c i y)p n ( c ^ y) 
» n (c€y|s**). 



whence S** also maximises (5.1) by optimality of S* . Thus treating those documents be- 



longing to topic c as class 1, and all others as class 0, by solving (1.1) with 9q and 9\ chosen 



appropriately, we can obtain all solutions to (5.1) 



In view of this, we use each of the methods to search for patterns S that have high 



prevalence for a given topic c. We then remove all patterns that do not satisfy (5.2) on the 



test data. Then, from the remaining patterns, we select the one that maximises (5.1) on 



training data. Below, we describe specific implementation details of each of the methods 
under consideration. 

Random Intersection Trees We create the min-wise hash table for the prevalence among 
all samples once, using 200 permutations with associated min-wise hash values for each word- 
stem. Then 1000 iterations of the tree search are performed with a cutoff value 6q = (3/20)p c 
and all remaining patterns S with a length less or equal to 4 are retained. 



Random Forests For a tree-based procedure, one approach is to fit classification trees 
on subsampled data and adding randomness in the variable selection as in Random Forests 
|Breiman 2001| and then looking among all created leaf nodes for the most suitable node 
among all nodes created. 

We generate 100 trees as in the Random Forests method: each is fit to subsampled training 
data using CART algorithm restricted to depth 4, and further randomness is injected by only 
permitting variables to be selected from a random subset of those available, for each tree. 
This takes on average between 90% to 110% of the computational time of a non-optimised 



pure R R Development Core Team, 2005 implementation of Random Intersection Trees for 



these data. Note that this is when using the Fortran version of |Breiman 2001 for the 
Random Forests node generation; we expect a significant speedup if Fortran or C code were 
used for Random Intersection Trees. We are currently working on such a version and plan to 
make it available soon. Furthermore, Random Forests would scale much worse if many more 
word-stems were included as variables. 



Linear models For linear models, we fit a sparse model with at most £ predictors (with 



< 4), using a logistic model with an ^i-penalty Tibshirani, 1996, Friedman et al. 2010]. 



We constrain the regression coefficients to be positive since we are only looking for positive 
associations in the two previously discussed approaches, and want to keep the same inter- 
pretability for the linear model. For each value of i < 4, we take Si to be the set of variables 
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with a positive regression coefficient. We select the largest value of £ such that the fraction 
of documents attaining the maximal value is at least p c /W and select the associated pattern 
S(_. (An alternative approach would be to retain the documents with the highest predicted 
value when using a sparse regression fit. This approach gave very similar results.) 



After screening the candidate patterns returned by each of the methods using (5.2) on 
all of the topics c G C, we evaluate the misclassification rate P n (c ^ C\S C X) on the test 
data. The results for all of the topics are shown in Figure |4j The rules found with Random 
Intersection Trees have a smaller loss than those found with Random Forests in all but 5 of 
the topics. For those topics where Random Forests performs better, the difference in loss is 
typically small. Linear models achieve a smaller loss than Random Forests among most of the 
topics, but only have a smaller loss than Random Intersection Trees in 6 topics, performing 
worse in all remaining 46 topics. 

6 Discussion 

We have proposed Random Intersection Trees as an efficient way of finding interesting interac- 
tions. In contrast to more established algorithms, the patterns are not built up incrementally 
by adding variables to create interactions of greater and greater size. Instead we start from 
the full interaction S = {1, . . . ,p} and remove more and more variables from this set by tak- 
ing intersections with randomly chosen observations. Arranging the search in a tree increases 
efficiency by exploiting sparsity in the data. For the basic version of our method (Algorithm 
1), we were able to derive a bound on the computational complexity. The bound depends 
on (a) the prevalence or frequency with which the pattern S appears among observations 
in class 1, and (b) the overall sparsity of the data, with higher sparsity making it easier to 
detect the interaction using a given computational budget. In the best case, we can achieve 
an almost a linear complexity bound as a function of p; more generally our complexity bound 
typically has a smaller exponent than that for a brute force search. Further improvements can 
be made by using min-wise hashing techniques to terminate parts of the search (i.e. branches 
of the Intersection Tree) that have no chance of leading to interesting interactions. Numerical 
examples illustrate the improved interaction detection power of Random Intersection Trees 
over other tree-based methods and linear models. 



There are many diverse ways in which interactions that solve (1.1) can be used in further 
analysis. The interactions may be of interest in their own right as shown in both numerical 
examples. One can also simply use the search to make sure that a dataset is unlikely to have 
strong interactions that could otherwise have been missed. If the aim is to build a classifier, 
they can be added to a linear model, or built into classifiers based on tree ensembles. For 
the latter approach one could consider, for example, averaging predictions in a linear way or 



averaging log-odds as in Random Ferns [Bosch et al. 20071 . We believe developments along 



these lines will prove to be fruitful directions for future research. We also plan to generalise 
the idea to categorical and continuous predictor variables. 
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7 Appendix 

Proof of Theorem [T] Fix a tree m G {1, ... , M} and suppose this has node set N = 
{1, . . . , J} indexed chronologically (see Section [2]). For d G {1, . . . , D}, define 

N d = {j G N : depth(i) = d and Sj D S}, 
W d = \N d \. 

Let E be the event that S is contained in Si, the random sample selected for the root node of 
tree m. Further, let G d (t) = K(t Wd \E), the probability generating function of W d conditional 
on the event E. 

We make a few simple observations from the theory of branching processes. Firstly, for 
d < D — 1, G d+ i = G d o G where G := G\. To see this, first note that 

w d+i = J2 E Mscx^}- 

je7V d j'ech(j) 

Now conditional on E, the random variables Ylj'ech.(j) ^-{Scx^,-,} f° r J 6 Nd, are independent 
of N d . Moreover, they are independent of each other and have identical distributions equal 
to that of 

E 1 {scx iU , )} = W 1 . 
j'ech(i) 

This entails 

E{t w ^\W d = w,E) = {E(t Wl \E)} w = {G{t)} w . 

Thus 

G d+1 (t) = E(E(t Wd+1 \W d , E)\E) = E({G(t)} w *\E) = G d (G(t)), 

as claimed. 

From this we can conclude that if G has a fixed point q G (0, 1), then this must be a 
fixed point for all G d . Since each G d is non-decreasing, we have that for all d G N, if q' < q 
then, G d {q') < q. The relevance of these remarks will become clear from the following: for 
an S' G L_D,m ; we have 

oo 

G D (F{S' 2 S\S' D S)) = J2 F ( w d = i\E)F{S' 2 S\S' D S) e 

e=o 

oo 

= J2 H{W d = l}r\{Si L D>m }\E) 
= F(S i L D>m \E). 

Thus if we can ensure that P(S" 2 S\S' 5 S) is at most q, then the final probability in the 
above display will also be at most q. 
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To get an upper bound for F(S' 2 S\S' ~D S), we argue as follows. The set S' is the 
intersection of D + 1 observations selected independently of one another. In order for some 
k G S c to be contained in S' , it must have been present in all these D + 1 observations. Thus 
by the union bound we have 



¥(S f 2 S\S' 5 S) < ^F(k e S'\S' 5 S) < pv D+ 



the rightmost inequality following from (A2). 
Now let the distribution of B be such that 



B 



b with probability I — a, 
6+1 with probability a. 



Note that 



E(B d ) = £{&(1 - a)Y{(b + l)a} d ~ e (fj = (b + a) d 

Using this, we see that the expected computational complexity of the algorithm is bounded 
above by 



log(p)Mj}(& + a)5 k + • • • + {(6 + a )6 k } D ' 
k=i 



< log(p)MD 



P+ E {{Q> + a)6 k ) D -l} 

k:(b+a)S k >l 



(7.1) 



Now given a q G (0, 1), we shall pick B £ N and a G [0, 1) to satisfy G(q) = q. To this 
end, observe that 

G(q) = (1 - q)(1 - 0i(l - g)) 6 + q(1 - - g)) 6+1 . 
Thus a and 6 must satisfy 

log(g) - log(l - a0i(l - q)) 

b + a = i — n ~E~F\ v\ h a 

log(l - 6>i(l - g)) 

-log( g ) + log(l-qfli(l- g )) 

" + a 

- log(g) 
" 0^1 -q) 

< 1 + (7.2) 
In the final line, we used the inequality 



\og(z)>(z-l)- ( ^p-, 0<z<l. 
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Next, we pick D to be the minimum D such that pv D+1 > q, so 

~log(p/q)~ 



D 



and in particular 



D < 



log(l/i/ 
log(p/q) 



1, 



log(l/i/) " 

Finally, note that the probability of recovering S is 



(7.3) 



(7.4) 



1 -[!-{! 



i L Am |£)}0i] M . 



Given the choices of a and 6 (7.2), and D (7.3), we have that 
taking M to be at least 



^ L.D, m |.E0 < g. Thus 

" log{l-(l-?)M (7 ' 5) 
guarantees recovery of 5 with probability at least 1 — 77. Substituting equations (17. 2|), (17. 41) 



log(f?) 



> 



log(r?) 



and (7.5) into the complexity bound (7.1), and writing e = (1 — q)/(2q) gives a bound for the 
computational complexity of 



log(p) 



log(l/r?) 1 + 2e log{p(l + 2e)} 
0! 2e log(l/i/) 



„ v log{(l+€)^/9i> 

P + {(^( 1 + 2e )) log(1/,/) - 1 } 

fc:(l+e)<5 fc >ei 



Given that e is bounded above, removing constant factors not depending on p, we get that 
the order of the computational complexity is bounded above by 



(p log(l/«0 _ 1 j 



fc:(l+e)<5 fc >6»i 



□ 



Proof of Corollary [2] Note that 



is bounded by 



p log(l/l/) 

fc:(l+e)4>9l 



The result then follows using the scaling of p log 2 (p) and the possibility of making e arbitrarily 
small in Theorem [TJ □ 



Derivation of (4.2) Writing r = mr2(S), we have 

(;) e ^m*)) ="e<"-0 

n— r+1 /• 

= E {('-1 



n- (£- 1) 
r 



n-t 
r 



n— r+1 

+ E 

£=1 
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The first two terms sum to zero leaving only the final term. Thus 

n\„ , . , „, x n ^ 1 [fn-l + 2\ fn-l + 1 



E a (mmh a (k)) = £ 
r / fees ^— ' \ r + 1 / \ r + 1 

?=i 1 ' 

n + 1 

r + 1 



(7.6) 



whence 

77-1-1 

E CT (minMfc)) = ^-. □ (7-7) 
fees r + 1 

Proof of Theorem [4] Writing 

k 2 \L-S,H) := i^minH, 



fees 



and suppressing dependence on 5 and if, we have 



7T17T2 — 7Ti?T2 = 7TlvT2 



n + 1 — 7r 2 )^i 

= r^- i (ti - Ti) - Ti r tt 2 7 f • (7.8) 

Consider L — >• oo. By the weak law of large numbers and the continuous mapping theorem, 
we have 

n + 1 — 7r7 (L) p 

A 7T 2 and 

nff 2 (L) 

717T2 + 1 p (7T 2 + n -1 ) 2 



n + l-^^L) ^(l + n- 1 )' 
By the central limit theorem, Slutsky's lemma and Lemma [5j 

A L := v<L(7ri(i) - 7ri) A JV(0,7ri(l-7ri)) and 

5 L := -TTx + * x VI (V(X) - -^±L > ) A AT(0, vr 2 (l - vr 2 ) + o n (l)). 

n + 1 - 7r 2 1 (L) V n7r 2 + 1 / 

Now 7Ti and iT2~ 1 are independent, so and Bl are independent. Thus we have that for all 
E^Ai+taBiJj = E(e itlAL )E(e it2BL ) -> exp{|(i 2 7n(l - tq) + i|(7r 2 (l - vr 2 ) + o n (l))}. 



pointwise as L — > oo. Returning to (7.8), by Levy's continuity theorem we have 



^7V(0,7T^vTl(l-7ri^2) + On(l)). □ 

Lemma 5. Let r = niT2(S) and suppose n > r + 2. Then 

(rn 2 — 3r 2 n + 2r 3 ) + (rn + 4n + 5r 2 + 4r + 2) 



Var CT (min h cr (k)) 



fees v /y (r + l) 2 (r + 2) 
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n-l 
r — 1 



Proof. We have, 

)E CT {(minM£;)) 2 }= £ 

= £ {«-d 

n— r+1 ^ 



n— r+1 



r 



n-* + l\ / n -£ + l 
r / V r 



n 
r + 2 



+ 



n + 1 
r + 1 



where in the last line we used ( |7.6[ ) and (7.7). From this, we get that 

(rn 2 — 3r 2 n + 2r 3 ) + (rn + 4n + 5r 2 + 4r + 2) 



Var CT (min h a (k)) 
fees 



(r + l) 2 (r + 2) 
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